Sequence-to-sequence models provide a simple and elegant solution forbuilding speech recognition systems by folding separate components of a typicalsystem, namely acoustic (AM), pronunciation (PM) and language (LM) models intoa single neural network. In this work, we look at one such sequence-to-sequencemodel, namely listen, attend and spell (LAS), and explore the possibility oftraining a single model to serve different English dialects, which simplifiesthe process of training multi-dialect systems without the need for separate AM,PM and LMs for each dialect. We show that simply pooling the data from alldialects into one LAS model falls behind the performance of a model fine-tunedon each dialect. We then look at incorporating dialect-specific informationinto the model, both by modifying the training targets by inserting the dialectsymbol at the end of the original grapheme sequence and also feeding a 1-hotrepresentation of the dialect information into all layers of the model.Experimental results on seven English dialects show that our proposed system iseffective in modeling dialect variations within a single LAS model,outperforming a LAS model trained individually on each of the seven dialects by3.1 ~ 16.5% relative.
展开▼